Near-optimal Individualized Treatment Recommendations

The individualized treatment recommendation (ITR) is an important analytic framework for precision medicine. The goal of ITR is to assign the best treatments to patients based on their individual characteristics. From the machine learning perspective, the solution to the ITR problem can be formulated as a weighted classification problem to maximize the mean benefit from the recommended treatments given patients’ characteristics. Several ITR methods have been proposed in both the binary setting and the multicategory setting. In practice, one may prefer a more flexible recommendation that includes multiple treatment options. This motivates us to develop methods to obtain a set of near-optimal individualized treatment recommendations alternative to each other, called alternative individualized treatment recommendations (A-ITR). We propose two methods to estimate the optimal A-ITR within the outcome weighted learning (OWL) framework. Simulation studies and a real data analysis for Type 2 diabetic patients with injectable antidiabetic treatments are conducted to show the usefulness of the proposed A-ITR framework. We also show the consistency of these methods and obtain an upper bound for the risk between the theoretically optimal recommendation and the estimated one. An R package aitr has been developed, found at https://github.com/menghaomiao/aitr.


Introduction
The individualized treatment recommendation (ITR) has drawn increasing attention in recent years. Because patients may respond differently to the same treatment (Lesko, 2007;Insel, 2009), it is desirable to individualize the treatment according to patients' characteristics. Mathematically, an ITR is a map from such characteristics (the covariates, or features) to a treatment. The goal is to find the optimal treatment so that the average benefit that patients will receive by following such a recommendation is maximized.
In the literature, many statistical approaches have been proposed for solving the ITR problem. In indirect modeling-based methods, one first builds a parametric or semiparametric model to estimate the expected outcome conditional on the covariates, then recommends the single treatment that renders the optimal outcome to the given patient (Robins, 2004;Qian and Murphy, 2011;Schulte et al., 2014). However, these methods require a correct model specification and an accurate estimation to perform well in practice. One may also obtain the optimal ITR directly. Zhao et al. (2012) proposed a classificationbased method, coined as the outcome weighted learning (OWL), to estimate the optimal ITR. They transformed the ITR problem into a weighted classification problem and used support vector machine (SVM), a classification method, to solve it. Built on top of the OWL framework, there has been a rapidly growing literature on different aspects of the ITR problem. Zhao et al. (2014) and Cui et al. (2017) extended the OWL framework to accommodate survival outcomes.  and Liu et al. (2018) proposed residual weighted learning (RWL) and augmented outcome-weighted learning (AOL) respectively to reduce the variability of the weights in OWL to enhance its performance. Chen et al. (2018) proposed generalized OWL (GOWL) to solve an ITR with ordinal treatments. Zhang et al. (2018a) proposed angle-based approach for the multicategory case (in which there are more than two treatments available to choose from). Recently, Zhao et al. (2019) and Huang et al. (2019) considered replacing the weights in OWL with a doubly robust estimator to further improve the robustness of OWL. Methods based on other learning algorithms such as trees (Laber and Zhao, 2015;Kallus, 2016;Doubleday et al., 2018;Zhu et al., 2017) and nearest neighbors Wu et al., 2019) have also been studied. Another example of direct methods is the work by Zhang et al. (2012), which searched for the ITR among a pre-specified class of decision rules that optimized a doubly robust augmented inverse probability weighted estimator of the overall population mean outcome.
Despite the success of these methods in recommending a single "optimal" treatment to patients, a method that can suggest multiple "near-optimal" treatment options to a patient is not fully studied. Such options could be desirable when several treatments have comparable effects. Laber et al. (2014) and Lizotte and Laber (2016) proposed a set-valued dynamic treatment regime. In particular, if there are two treatments available (1 and −1), their set-valued rule may report {1}, {−1}, or {1, −1}. However, this approach is applicable only to cases with two competing outcomes. They would recommend the set {1, −1} if any treatment cannot be proven to be inferior to the other based on both outcomes. On the other hand, they used a regression-based method to estimate the optimal set-valued rule, which may suffer an inconsistency issue if the model is misspecified. Yuan (2015) considered a framework to allow a reject option in ITR estimation based on OWL. However, the method is restricted to the binary case with only two possible treatments.
In this paper, we propose to study the ITR problem in the setting with only one clinical outcome from a new perspective. Different from the previous ITR work, it provides a set of treatment options that are near the optimality and are alternative to each other, which we called alternative individualized treatment recommendations (A-ITR). Specifically, if multiple treatments are expected to result in similar and near-optimal clinical outcomes for the patient, then they are all recommended to the patient (or the physician), who can choose any one of them to use after incorporating other considerations. There are several reasons this approach may be more desirable than a single treatment recommendation. First, since multiple treatments may yield the same or similar outcomes for some patients, the ranking among the near-optimal treatment options may vary randomly. If only the top treatment is reported, such a seemingly "random" recommendation may severely undermine the trust of the physicians and the patients toward the treatment recommendation system. Second, when the expected outcomes for multiple treatments are indistinguishable, it may be both legally and morally inappropriate to withhold such important information from the patients. Third, A-ITRs allow physicians and patients to incorporate other factors into their decision-making process regarding the treatment. These other factors include the healthcare expense, the painfulness of the treatment, the side-effect of the treatment, the life quality and lifestyle, and so on. For example, when two treatments are expected to have similar outcomes, it is reasonable to choose an option that is covered by the insurance, that is less painful, that has less side-effect, and that does not significantly compromise the quality of life. In this sense, conventional ITR methods that only recommend one treatment may prevent patients from making informed decisions about their lives.
We will propose two methods to estimate A-ITR. Parallel to the development of the conventional ITR methods, we first introduce a regression-based plug-in method to estimate the optimal A-ITR, which will serve as the baseline. Within the OWL framework, we propose two classification-based methods. The technical tool we will use is multicategory classification with reject and refine options (Zhang et al., 2018b).
The rest of the paper is organized as follows. In Section 2, we review the background of the ITR and the classification with reject and refine options problems. We then introduce the proposed A-ITR framework and discuss several estimation methods in Section 3. Detailed algorithm and the tuning procedure can be found in Section 4. We demonstrate the empirical performance through simulation studies in Section 5 and a real-world application using Type 2 diabetes mellitus data in Section 6. Theoretical studies of the proposed method can be found in Section 7. Some concluding remarks are given in Section 8. All technical proofs are provided in the supplementary materials.

Individualized Treatment Recommendation
Denote the covariates of a patient by X ∈ X. Each treatment is denoted by a random variable A, where A ∈ A = {1, 2, …, k} (k treatments available.) After assigning treatment to a patient, we observe an outcome Y ∈ ℝ + . Here we assume Y is bounded. Unlike many other ITR methods, we assume smaller Y is preferred due to a small technicality that can allow some computational savings. An individualized treatment recommendation, previously often referred to as an individualized treatment rule, is a map d : X A.
Let Y*(j) denote the potential outcome that would have been observed when treatment j is assigned to the patient with covariates X. The actual observed outcome Y is related to the potential outcomes by Y = ∑ j ∈ A Y * (j)1[A = j]. Define p(A = j | X) as the conditional probability of treatment j given X. We assume the following assumption.
Assumption 1 For any j, Y*(j) is independent of A given X and p(A = j | X) > 0 almost everywhere.
Under Assumption 1, it was shown by Qian and Murphy (2011) and Kallus (2016) that the expected outcome under ITR d is where E d is the expectation under ITR d. Note that p(A|X) is usually known in a randomized trial, while in an observational study it is unknown and needs to be estimated.
Denote μ j = E(Y | X, A = j), for j = 1,…, k. Then the optimal ITR d* is that is, the optimal treatment for a patient has the smallest (the best) expected outcome.
Many methods have been proposed for estimating the optimal ITR. One method is often called "regression and comparison" or Q-learning (Robins, 2004;Qian and Murphy, 2011). One first estimates the conditional mean μ j (x) = E(Y | X = x, A = j) for each treatment j, then the optimal treatment is obtained by plugging the estimators in (2). However, this method relies on the accuracy of the regression model. If the model is mis-specified, the error could be fairly substantial. Another group of methods treat the problem as a classification problem. One example is called outcome weighted learning (OWL) or O-learning (Zhao et al., 2012O-learning (Zhao et al., , 2014O-learning (Zhao et al., , 2019Zhang et al., 2018a). In the OWL framework, by noting (1), we rewrite the ITR solution as which is closely related to a weighted classification problem with weight Y/p(A|X). To overcome the non-continuity and non-convexity of the 0-1 loss, we can replace 1[A = d(X)] by a convex surrogate loss L(A, f(X)) in the empirical counterpart of (3) and solve instead where E n denotes the empirical expectation, and f is a vector-valued function defined on X.
The estimated ITR d is then obtained from f.
The relationship between d and f depends on the choice of the loss function L and how f is defined. Zhao et al. (2012) proposed to replace the 0-1 loss by hinge loss in the binary case (k = 2, A ∈ {1, −1}), that is, L(A, f) = (1 − Af) + , where x + = max(x, 0), and f is a scalar-valued function. In the current setting that a smaller Y is preferred, they could have used L(A, f) = (1 + Af) + . They showed that the optimal ITR can be estimated by d = sign (f). Zhang et al. (2018a) then extended to the multicategory case using a large-margin loss under the angle-based learning framework (Zhang and Liu, 2014). Specifically, define f(x) = (f 1 , …, f k − 1 ) T (x) ∈ ℝ k − 1 and W 1 , …, W k are vertices of a (k − 1)-dimensional simplex with equal pair-wise distances, defined as where 1 k−1 is a (k−1)-dimensional vector with all 1 and e j − 1 ∈ ℝ k − 1 is a vector with the (j−1)th element 1 and 0 elsewhere. They let L(A, f(x)) = ℓ(⟨W A , f(x)⟩), where ℓ is a typical large-margin surrogate loss for binary classification (except that it is increasing instead of decreasing), and ⟨W A , f(x)⟩ denotes the inner product of the vectors W A and f(x). From the geometry point of view, treatment j is represented by vertex j of the simplex, and the angle between f(x) and W j , ∠(W j , f(x)), indicates how far away f(x) is from each of these treatments. The resulting ITR was estimated by d(x) = argmin j ∠(W j , f(x)) = argmax j 〈W j , f(x)〉, that is, the best treatment is the one whose corresponding vertex is closest to f(x) in terms of the angle.

Remark 2
In the ITR literature, one typically assumes that larger values of the outcome Y are preferred, so that instead of minimization, d* maximizes the objective (3), or , which was indeed a weighted classification problem. In this article, recall that we assume smaller values of Y are preferred. As a consequence, 1[A ≠ d(X)] is replaced by 1[A = d(X)] in (3); additionally, the surrogate loss function is flipped with respect to the origin so that it is an increasing function instead of a decreasing function.

Classification with Reject and Refine Options
We aim to provide set-valued recommendations that are near the optimality and are alternative to each other. To this end, we borrow the idea of multicategory classification with reject and refine options as a technical tool. Classification with a reject option has been widely studied. Herbei and Wegkamp (2006) formulated the problem as a minimization problem under the 0-d-1 loss. That is, the loss of a misclassified instance is 1 and the loss of a rejected instance is d, where 0 ≤ d ≤ 1/2. Bartlett and Wegkamp (2008) proposed an estimation procedure under the hinge loss. Yuan and Wegkamp (2010) extended this framework to a broad class of surrogate loss functions. Zhang et al. (2018b) generalized it to the multicategory case.
We first introduce binary classification with reject option. Let (X, A) be a pair of random variable with X ∈ X and class label A ∈ {1, −1}*, and denote p j (x) = p(A = j | X = x) as the conditional class probability given X. The goal is to train a classifier ϕ(x) that produces three possible outputs: 1, −1, and 0. Here 0 stands for a "reject" option, meaning that the classifier refuses to make a prediction based on the information available. Note that the decision "0" can be viewed as a set-valued decision of {1, −1}. Chow (1970) proposed the 0-d 0 -1 loss with corresponding risk function P(ϕ(X) ≠ A, ϕ(X) ≠ 0) + d 0 P(ϕ(X) = 0) and it was shown that the Bayes rule under this risk is Here d 0 ∈ [0, 1/2] controls the cost for refusing to make a classification. Intuitively, we produce the reject option "0" only when both p 1 (x) and p 2 (x) are close to 1/2. Bartlett and Wegkamp (2008) proposed a bent hinge loss to estimate the optimal rule ϕ*. The bent hinge loss is defined as ℓ(u) = max(0, 1 − u, 1 − (1 − d 0 )u/d 0 ), i.e., the common hinge loss with a bent slope at 0. The effect of such bent slope is to shrink f(x) to 0 when p 1 (x) and p 2 (x) are close. For f * (x) = argmin f(x) E(ℓ(Af) | X = x), we have ϕ* = sign(f*).
The situation is much more complicated for multicategory classification. Suppose there are 3 classes, that is, the so-called refine option, which is a set-valued decision with cardinality strictly greater than 1 and less than k. It contains all those class labels which are nearly as plausible as the most plausible class. Zhang et al. (2018b) proposed to use a class of loss functions in conjunction with the angle-based learning framework (Zhang and Liu, 2014) to train a setvalued classifier that can render these different options. We note that both the reject option and the refine option are set-valued decisions, and they are analogous to the set-valued recommendations in this work.

Methodology
In this section, we introduce the framework of alternative individualized treatment recommendations (A-ITR) and propose two methods to estimate the optimal A-ITR.

A-ITR Framework
There are several situations in which ITRs with additional alternative options are desirable. Even with small errors, when several treatments are near the optimality, the ranking of these treatments based on their estimated outcomes may differ from their true ranking. In this case, reporting only one treatment based on the estimated value is problematic. Secondly, when the error in the learning problem is substantially large, the so-called optimal treatment reported by conventional ITRs may lead to an outcome that is much worse than some of the other treatment options. In these situations, recommending a single treatment only adds to the distrust that patients may already have towards such black-box algorithms that they know little about. On the other hand, A-ITR provides a safety net, preventing from committing to a single treatment that is only one out of multiple treatments with similar or indistinguishable outcomes. Morally, as patients are more mindful about their financial responsibility and their quality of life, it is more appropriate to present these alternative options and have the patients themselves to make an informed decision, especially when many of these decisions are life-changing.
An A-ITR is a set-valued map ϕ : X 2 A ∖ ∅. Inspired by the idea of classification with reject and refine options, given a user-predefined number c ≥ 1 that defines the scope of the near-optimal treatments, we formally define the optimal A-ITR as, where μ (1) is the smallest conditional mean outcomes. This optimal A-ITR set contains all the treatment options with μ j close enough to that of the optimal one, up to a multiplicative constant c. Recall that we assume smaller Y is preferred. Note that for certain c and x, ϕ*(x) may contain only one element, that is, the treatment with the smallest mean outcome, which corresponds to the conventional ITR. If it includes all the treatments, it is a non-informative recommendation, analogous to the reject option in set-valued classification problem. Here we call c the near-optimality constant.
physician's judgment about how close is "indistinguishable" and may vary a lot depending on the application. One value of c (say 10) may be appropriate for one clinical outcome but can be too big for another. In practice, c is chosen according to the physician's experience, or by empirical data if available. A possible candidate for c is an estimate to exp[SD(log(Y*) | x)]. Moreover, the physician may also try two or three different c values which may lead to recommendations with varying cardinalities. These recommendations may then be presented to the patient in the order of increasing cardinality. Caution should be exercised when communicating about these new alternative options and the corresponding possible sacrifice to the outcome.

Remark 4
The optimal A-ITR ϕ* defined in (5) is not invariant to addition but is invariant to multiplication. If it were invariant to addition, then the assumption of non-negativity of Y would be moot. In practice, many clinical outcome are positive (i.e., year of survival, blood count, etc.). A negative clinical outcome Y (i.e., a decrease of blood pressure) may be transformed to be positive, for example, by Y = exp(Y ). In any case, a transformation of the original clinical outcome may be needed to adapt the definition (5) to the specific clinical context. For example, in certain clinical contexts, it could make more sense to define the optimal treatment options as those with expected outcomes less than μ (1) + b where μ (1) is the smallest mean outcome (in the original, untransformed scale). In this case, we may define Y = exp(Y ) so that the optimal treatment options are those with expected (transformed) outcome less than μ (1) × c with c = exp(b). See Section 6 for a real data example in which some data transformation is done.

Estimation
We consider two types of methods to estimate the optimal A-ITR: the regression-based methods and the classification-based methods. For regression-based methods, we can use Qlearning to first estimate the conditional mean μ j (x) = E(Y | X = x, A = j) for each treatment j, then plug into (5), i.e., ϕ(x) = {j | μ j (x) ∕ μ (1) (x) ≤ c}. The success of this regression-based plug-in method relies on accurate estimation of μ j .
In contrast, the classification-based method targets on estimating the true boundary between different decision regions, bypassing the need to estimate μ j directly. In the rest of the section, we propose two classification-based methods within the OWL framework, both of which are based on the angle-based learning approach (Zhang and Liu, 2014). Zhang et al. (2018a) first made use of the angle-based learning approach to solve the ITR problem, in which they denoted W 1 …, W k as the vertices of a (k − 1)-dimensional simplex and they chose the loss L(A, f(x)) in (4) to be a function that only depends on the inner product Define f* to be the population minimizer under such loss, that is The end product of Zhang et al. (2018a) was a single-treatment ITR. In the ideal case that f* can be obtained, their ITR was defined as d f* (x) = argmax j∈{1,…,k} ,⟨W j , f*(x)⟩, and it can be shown that as long as ℓ is convex and strictly increasing, Fisher consistency holds, i.e., d f* = d*. In practice, given the training data set {(x i , a i , y i )} i = 1 n , f, the estimate of f*, is obtained by, where ℱ ⊆ {f | X ℝ k − 1 } is a class of functions, and J(f) is a penalty term to prevent overfitting.
Both our proposed A-ITR methods are derived from the empirical minimizer f in (7) with an aim to estimate the population minimizer f* in (6). The difference lies in the loss function ℓ they use, and how they convert f or f* to the final set-valued recommendations.

Two-step OWL METHOD-For
the two-step method, we use a convex, differentiable, and increasing loss function ℓ D . Given any f (which may be f* or f), to obtain the A-ITR, it is instrumental to first order the vertices W j 's, j = 1,…, k, which represent the k treatments, in the manner of reversed order statistics, ⟨W (1) , f⟩ > ⋯ > ⟨W (k) , f⟩. It turns out (see Proposition 5 below) that when f = f*, the jth reversed order statistic (i.e., the jth largest) ⟨W (j) , f⟩ corresponds to the jth order statistic (the jth smallest) μ (j) where μ (1) < ⋯ < μ (k) .
The resultant two-step estimator of the optimal A-ITR is then defined as Here ℓ D ′ is the first derivative of ℓ D , and the superscript "D" indicates that f is the solution based on a differentiable loss function. Our estimator is motivated by the following result.
Proposition 5 (Zhang et al., 2018a) Let f* be the population minimizer in (6) in which ℓ is a convex and differentiable function ℓ D with ℓ D ′ (u) > 0 for all u. For any i ≠ j ∈ {1,…, k}, we Proposition 6 Let f* be the population minimizer in (6) in which ℓ is a convex and differentiable function ℓ D with ℓ D ′ (u) > 0 for all u. The ϕ f * D defined in (8) based on f* coincides with the optimal A-ITR ϕ* in (5).
This method is a two-step procedure because it first estimates f* using f, then estimates the ratios of conditional means μ j /μ i using ℓ D ′ (〈W i , f〉) ∕ ℓ D ′ (〈W j , f〉) which is then plugged into (5) to obtain the A-ITR. Note that it does not estimate each conditional mean individually, but their ratios. The issue remains that if f* is not accurately estimated, then the ratio μ j /μ i cannot be accurately estimated.

One-step OWL Method-
The one-step method aims to directly obtain a set-valued recommendation without calculating ℓ D ′ (〈W i , f〉) ∕ ℓ D ′ (〈W j , f〉). The crucial difference here is the use of a bent loss function, defined as where ℓ 1 > 0 is a convex and increasing function with ℓ 1 ′ (u) = 1 for all u ≥ 0, and ℓ 2 (u) = (c − 1)u + with c ≥ 1. Such a loss function is bent at 0, since ℓ B ′ (0 − ) = 1 and ℓ B ′ (0 + ) = c. Note that the slope c is the same as the near-optimality constant as defined in (5). An example of bent loss is the bent hinge loss, ℓ B (u) = (1 + u) + + (c − 1)u + (see Figure 1.) The bent loss has been a critical tool that helps to achieve reject (and refine) options in the classification literature (Bartlett and Wegkamp, 2008;Zhang et al., 2018b).
The main effect of the bent loss is to shrink the angle margin for class j (or treatment j here), defined as ⟨W j , f(x)⟩, towards 0, similar to the shrinkage effect of the lasso penalty. Likewise, the additional slope c − 1 for u > 0 is analogous to a penalty parameter in lasso regression, which would encourage a sparse model. Note that here such a shrinkage effect is applied to the classes/treatments with positive angle margins only. Specifically, Proposition 7 below, derived from Proposition 1 in Zhang et al. (2018b), gives the precise values of the angle margins ⟨W j , f(x)⟩'s with respect to f*, the population minimizer of (6) with the bent loss ℓ B .
A direct consequence of Proposition 7 is that all the near-optimal treatments (defined as μ (j) /μ (1) ≤ c) have non-negative angle margins, while the rest have negative angle margins.
This naturally leads to the following set-valued recommendation, Meng et al.
Page 10 Here f can be the population minimizer f* (6) or the empirical minimizer f (7) and the superscript "B" indicates that the loss ℓ is a bent loss ℓ B , as opposed to a differentiable loss function in the two-step method.
following assumption is necessary to resolve the identifiable issue of (9) and to show its optimality.
Assumption 8 guarantees the two sets, {x ∈ X | μ j (x) ∕ μ i (x) = 1} and {x ∈ X | μ j (x) ∕ μ i (x) = c}, where c is the near-optimality constant in (5), have measure 0 for any i ≠ j so that f* is identifiable almost everywhere. Under Assumption 8, we have the following proposition, analogous to Fisher consistency in classification.
Proposition 9 Suppose Assumption 8 holds. Let f* be the population minimizer in (6) in which ℓ is a convex and increasing function ℓ B with ℓ B ′ (0 − ) = 1 and ℓ B ′ (0 + ) = c ≥ 1. The A-ITR ϕ f * B defined in (9) based on f* coincides with the optimal A-ITR ϕ* (5) with the near-optimality constant c.
While Assumption 8 is useful as a technical assumption, it may not hold in certain practical situations. For example, when two treatments have the same conditional mean outcomes for a group of patients, Assumption 8 does not hold for c = 1. Another case that it is more likely to fail is when the outcome Y can only take finite and discrete values. Even if it does not hold, the proposed methods could still be useful. See Example 3 in Section 5 in which Assumption 8 is violated.
Note that for both classification-based methods, a single-valued ITR can be easily defined by recommending the treatment option with the largest angle margin, that is, argmax j∈{1,…,k} ⟨W j , f(x)⟩.
Unlike the regression-based method, the two classification-based methods do not estimate the conditional mean outcome. The success of the regression-based method relies on an accurate estimation of μ j at every x of interest, while reasonable performance is expected for the classification-based methods as long as the estimation is accurate around the "boundaries". However, the two-step method and the one-step method seem to have different focuses. Both methods start with finding a discriminant function to minimize the outcomeweighted misclassification rate for the purpose of minimizing the expected outcome. As a consequence, both methods have "good" performances near boundaries that distinguish the optimal treatment from the non-optimal treatments for each patient. The one-step method, additionally, uses a bent loss with a shrinkage effect that is capable of determining whether a treatment is close enough to, not whether it is equal to, the optimal treatment. More precisely speaking, the one-step method calibrates the boundary defined by μ j (x)/μ (1) (x) = c. This is theoretically justified by Proposition 7. Hence, the one-step method also has "good" performance near such new notions of boundaries.
To illustrate the additional strength of the one-step method, we show the boundaries between recommendations for a toy example (the details of which will be revisited in the numerical studies) in Figure 2, in which the top row shows the single-valued ITR and the second row the set-valued A-ITR, by the Bayes rule, the two-step method, and the one-step method respectively. Both classification-based methods give good approximations to the Bayes ITR boundaries, shown in the top row. However, the two-step method seems to include more treatments into the near-optimal set when compared to the Bayes rule (shown in the bottom row), than the one-step method does. For example, the two-step estimator displays much more recommendations with 2 or 3 treatment options. This is probably due to the fact that the optimization for the two-step method is not designed to capture this subtle pattern, at least not with a finite sample.

Implementations
In this section, we discuss various aspects of the implementations for the proposed methods, including the optimization, the normalization of the predictive function, and the parameter tuning.

Algorithm
In this section, we introduce the optimization procedure to estimate f* defined in (6). Instead of the constrained problem (7), we solve the regularized problem: where λ is a tuning parameter. It is a weighted classification problem with weight w i = y i /p(a i |x i ).
In terms of the function class ℱ, there are linear learning and kernel learning (Steinwart et al., 2007;Hofmann et al., 2008;Hastie et al., 2009). Let f = (f 1 , …, f k − 1 ) T ∈ ℱ, and for simplicity, we add a constant term to x. Then for linear learning, we have f j (x) = x T β j , and the corresponding penalty J(f) = ∑ j = 1 k − 1 ‖β j ‖ 2 = ∑ j = 1 k − 1 β j T β j . For kernel learning, is a kernel function. The penalty term becomes J(f) = ∑ j = 1 k − 1 α j T Kα j + ∑ j = 1 k − 1 α 0j 2 , where K is the gram matrix. Note that we include the intercept term into J(f) and a benefit by doing this is the reduction of the complexity of the algorithm. Zhang et al. (2016) shows theoretically that it can achieve the same convergence rate as the case without the intercept term.
We proposed the two-step method and the one-step method. The two-step method is based on a differentiable loss ℓ D , while the one-step method is based on a bent loss ℓ B (u) = ℓ 1 (u) + ℓ 2 (u), where ℓ 1 is convex and ℓ 2 (u) = (c − 1)u + . Since ℓ D is similar to a special case of ℓ B with section, we use linear learning to demonstrate our algorithm and have deferred the details about kernel learning to the supplementary materials.
We first consider the case when ℓ 1 is differentiable. In this case, we use the ADMM (Boyd et al., 2011) algorithm to solve (10). The ADMM algorithm is used when the objective function can be written as a sum of two convex functions, which, in our case, are ℓ 1 and ℓ 2 .
At step t, for each j = 1,…, k − 1 we can update B t , G t and Z t as until matrix B converges. Note that in the two-step method where c = 1, we have ℓ 2 (u) = 0. In this case, we can force B = G and only update β j t 's until they converge.
Next we consider the case when ℓ 1 is not differentiable. In the literature of classification, a non-differentiable loss that has been commonly used is hinge loss. Note that in our case, since we prefer smaller outcomes, we define the hinge loss as ℓ 1 (u) = (1 + u) + (see Figure 1).
That is, we flip the traditional hinge loss with respect to the y-axis to make it an increasing function. A typical approach to an optimization problem with the hinge loss is to transform it into a quadratic programming (QP) problem in its duality (Fung and Mangasarian, 2005;Hastie et al., 2009;Zhang et al., 2018b). Specifically, the dual problem of (10) can be written as where β j = − 1 nλ ∑ i = 1 n (α i + (c − 1)γ i )W ai, j x i , and W a i ,j is the jth component of W a i . Note that objective function is quadratic in α i and γ i , it has explicit solution at each iteration. Thus it converges very fast by using algorithms such as coordinate decent (Zhang et al., 2018b).
In practice, there may be numerical errors to the solution. Moreover, due to different choices of the tuning parameter λ, the scale of the resulting angle margins may vary much between different tuning trials. We propose the following normalization procedure for the one-step A-ITR ϕ f B (9) to boost the empirical performance. The idea is that instead of recommending all treatments with angle margins greater than or equal to 0, we change the threshold to a small number varying around 0. Such a threshold is a fixed constant δ multiplied by a measure of the scale, chosen to be the magnitude of the smallest angle margin. The normalized one-step A-ITR is then where δ is a tuning parameter around 0 and M(x) = | 〈W (k) , f(x)〉 | is the magnitude of the smallest angle margin (note that 〈W (k) , f(x)〉 is negative).

Tuning Procedure
In this paper, the estimation procedure involves two tuning parameters. The first one is the regularization parameter λ in (10) which appears in both the two-step and one-step methods. The second one is the normalization parameter δ in (11) for the one-step method only. We will tune these two parameters differently in two steps.
The first step is to tune λ. For each λ, the estimated solution is f. Then we define the corresponding single-treatment ITR as d f = argmax j 〈W j , f〉 and calculate its empirical average of the expected outcome (1), which is given by . We choose the λ that yields the smallest empirical risk for the resulting ITR, even if our ultimate goal is to obtain a set-valued A-ITR. This can substantially simplify the tuning process. We found that other more complicated tuning procedures have led to a similar performance.
For the one-step method, we need to continue to tune δ. For the same λ (same resulting ITR), because different δ's may lead to slightly different set-valued A-ITRs and recommendations with different carnalities, we must actually compare the resulting A-ITRs to choose the best δ, instead of using the ITR as a proxy. However, there are some difficulties in evaluating the performance of the estimated A-ITR. Compared to the conventional ITR, the challenge here is that when the recommendation includes two or more treatment options, there are multiple potential outcomes and it is difficult to quantify the "overall" benefit for such a recommendation.
Although the proposed optimal A-ITR ϕ* defined in (5) is not a Bayes rule under any loss function, we can consider a closely related loss function, whose risk function is given by where ϕ : X 2 A ∖ ∅ is a set-valued predictor and | · | denotes the cardinality of a set.
Intuitively, ϕ + (x) is an optimal set of treatments selected to minimize the "average" clinical outcome with a penalty on the cardinality of the recommendation set. Note that when c = 1, ϕ + is the same as the optimal ITR d*. Moreover, when k = 2, ϕ + is the same as the optimal A-ITR ϕ*, as shown above. When k ≥ 3, ϕ + and ϕ* are different but are nested within each other in the following way: if we let S t * = {x ∈ X | | ϕ * (x) | ≤ t} and S t + = {x ∈ X | | ϕ + (x) | ≤ t}, then we have S 1 * = S 1 + , and S t * ⊆ S t + for t = 2,…, k − 1. Figure   3 demonstrates their relationship when k = 3.
From Figure 3, we observe that the regions with only one treatment are the same (S 1 * = S 1 + ), while the regions containing two or three treatments are slightly different. In general, the boundaries between the size-1 decisions and their complements are the same for the two rules ϕ + and ϕ*. They only differ in the boundaries between recommendations with different cardinalities (for example, the boundary between size-2 decisions and size-3 decisions). Although ϕ* does not directly minimize the weighted outcome defined in (12), the similarity between ϕ + and ϕ* justifies the use of the weighted outcome (12) as a new criterion for the tuning parameter selection. Specifically, we choose the δ value that can yield the smallest value of the following empirical counterpart of (12), In addition to the tuning parameter selection, we may also use this criterion to select different methods for conducting A-ITRs. In the real data analysis, we will use this criterion to select between the two proposed classification-based methods.

Simulation Studies
In this section, we study the numerical performance of the proposed methods.

Comparing Set-valued Recommendations
For two ITRs d 1 and d 2 , we can compare them by evaluating the expected outcome defined in (1). However, for two A-ITRs ϕ 1 and ϕ 2 , it is difficult to quantify which one is better due to the fact that a measure for the overall benefit is not well defined when multiple treatments are recommended. Although in Section 4.2 we have proposed the weighted expected outcome (12) for evaluating two A-ITRs, the optimal A-ITR ϕ + under this new criterion is still different from the desired near-optimal recommendation set ϕ*. So in the simulation studies, in addition to the empirical weighted outcome (13), we consider another means to compare different A-ITRs, using the expected outcome of the best and the worst treatments among the treatments that are recommended, averaged over a set of observations. We conduct such an evaluation for different types of recommendations separately to see how the A-ITR performs differently on them. Based on the size of the true optimal A-ITR ϕ*, we split the covariate space X into three regions corresponding to three kinds of recommendations: R 1 = {only one treatment is suggested} = {x ∈ X | | ϕ * (x) | = 1}, R 2 = {more than one treatment but not all of them are suggested} = {x ∈ X | 1 < | ϕ * (x) | < k}, R 3 = {all treatments are suggested} = {x ∈ X | | ϕ * (x) | = k} .
Note that R 1 , R 2 and R 3 are disjoint and X = R 1 ∪ R 2 ∪ R 3 . When c = 1, ϕ* is the optimal ITR d* and X = R 1 . When c > 1, we may have non-empty regions R 2 and R 3 .
For two A-ITRs ϕ 1 and ϕ 2 , we will compare them separately on R 1 , R 2 and R 3 . In each region, since multiple treatments may be suggested, we can compare the expected minimal outcome and the expected maximal outcome that they may lead to. Recall Y*(j) is the potential outcome by taking treatment j. Mathematically, we consider a performance interval, where the first quantity indicates the expected outcome if one can always use the best treatment within the recommended set ϕ(x) and the second quantity represents the worst situation, i.e., how bad it can be if one always chooses the worst treatment among the recommended options. Note that on R 1 , the two quantities are the same under ϕ* since only one treatment is recommended. As we increase c, we expect that this interval becomes wider on R 2 and R 3 since the diversity of the recommended options increases. From the definition of this interval, we claim that ϕ 1 is better than ϕ 2 if both the lower and the upper limits of this interval under ϕ 1 are smaller than their counterparts under ϕ 2 , on each region.

Results
We consider three simulation examples. For each example, we let X be uniformly sampled from X = [0, 1] 5 and X = [0, 1] 10 respectively. For simplicity, we assume A ⊥ X and p(A|X) = 1/k, and let Y = μ A (X)+ϵ where ϵ ~ N(0, 1/2). In each case, we let the training sample size to be n = 500, 1000, 2000, and use a test set with sample size 1000 to evaluate the performance. We compare three methods, namely, the regression-based method, the two-step classification-based method with squared loss, and the one-step classification-based method with the bent hinge loss. For each method, we output both ITR and A-ITR with c = 1.2.
Finally, we repeat each simulation 100 times and report the averages.
Example 1: This is an example with three treatments, where two conditional mean outcome functions are polynomial and the other is linear. Specifically, we have μ 1 (X) = 1 + 3X 1 2 + 3X 2 2 , μ 2 (X) = 3 − 0.5X 1 2 + 0.5X 2 2 , and μ 3 (X) = 3 + X 1 + X 2 . The upper panel in Figure 4 shows the true boundaries for the three treatments. We use polynomial kernel for both the two-step and one-step methods. The tuning parameter λ is chosen from 5 −6 to 5 2 .
Specifically, treatment 2 and 4 are dominated by treatments 1 and 3 and the optimal ITR should only output either 1 or 3. However, in certain regions treatment 2 and 4 still produce fairly good outcomes which can only be captured by A-ITR. The lower panel in Figure 4 shows the true boundaries. For the two-step method, we report the results using Gaussian kernel. For the one-step method, we report the results with polynomial kernel. The tuning parameter λ is chosen from 5 −9 to 5 −1 .
Example 3: This is an example where Assumption 8 is violated.
We note that the performance intervals for A-ITR always cover the expected outcomes of the single-valued ITR. This implies that by applying our proposed A-ITR framework, patients will potentially get a much better outcome as long as they are willing to consider other equally effective options identified by the A-ITR. Even if the patient does not choose the best option within the recommendation set, the worst case is not too bad and the ratio of its outcome to that of the best option is about c if the A-ITR is accurately estimated.
We compare different methods by inspecting the length and location of the A-ITR performance interval. Recall that the A-ITR with the shortest interval, the smallest lower limit, and the smallest upper limit on each region is the best A-ITR. However, since R 3 is the region where all treatments are near the optimality, different recommendations are expected to perform similarly. Hence we focus on regions R 1 and R 2 for the purpose of comparison.
From Table 1, we note that the regression-based A-ITR, though has the smallest lower limit in some cases, also has a longer interval in most cases, suggesting that the treatment could either go really well or really badly. This implies that the regression-based A-ITR method tends to include ineffective treatments into the near-optimal set. Part of the reason may be that the regression-based method has not accurately estimated each of the three or four potential outcome functions.
For the classification-based A-ITRs, the lower limits are roughly the same between the onestep method and the two-step method; however, the one-step method has shorter intervals in most cases. This means that the one-step method is better at excluding ineffective treatment options from the recommendation than the two-step method. This is true even when Assumption 8 is violated (Example 3). In addition, the one-step method also has the smallest expected weighted outcome (shown in the "All" column). However, the performance of both the one-step method and the two-step method becomes worse as the sample size decreases (see supplementary material), or the number of covariates increases. This may be due to the inefficiency of using the inverse probability weighting in the OWL framework.

Real Data Analysis
In this section, we apply our proposed A-ITR framework to a Type 2 diabetes mellitus (T2DM) observational study. The data set contains 1139 patients. Every patient was assigned one out of four diabetes treatments, which are GLP-1 receptor agonists alone, long-acting insulin alone, intermediate-acting insulin alone, and insulin regimens including short-acting insulin. The endpoint is the change of hemoglobin A1c level before and after the treatment, which is denoted by ΔHbA1c. In practice, if the treatment works, this value is usually negative (meaning that the hemoglobin A1c level decreases). The smaller ΔHbA1c is, the more effective the treatment is.
We first preprocess the original data set. Among the 19 covariates, we exclude those with a large proportion of missing values and with extremely imbalanced categories. We then impute the rest of them using the predictive mean matching method (Van Buuren, 2018). There are 10 covariates left after the preprocessing: gender, diabetic retinopathy, diabetic neuropathy, age, weight, body mass index (BMI), baseline hemoglobin A1c level, baseline high-density lipoprotein cholesterol (HDL), baseline low-density lipoprotein cholesterol (LDL), and heart disease.
For the outcome variable ΔHbA1c, we can reduce its variability by subtracting an estimate of its conditional mean E(ΔHbA1c | x) to make the estimation of f more robust . Here we use the ordinary least square regression to estimate E(ΔHbA1c | x). Denote the estimated mean function fitted by regression as m(x), we then observe that ΔHbA1c − m(x) can be positive or negative. We perform an exponential transformation to make it positive, which also justifies the use of ratio μ j /μ (1) to determine the near-optimal recommendation set. Specifically, we let Y = exp ((ΔHbA1c − m(x)) ∕ 5).
If we further assume conditional normality for ΔHbA1c given X and treatment j, with mean ν j (x) ≡ E(ΔHbA1c | X = x, A = j) and equal variance across treatments, then Y | (x, j) follows a log-normal distribution with mean proportional to exp(v j (x)/5). Then the optimal A-ITR is, In this study, we choose the near-optimality constant c = 1.2, so that 5 log c ≈ 0.9. This implies that the near-optimal recommendation set is constructed by including all treatments with conditional means ΔHbA1c within 0.9 of the optimal treatment.
We compare the performance of the regression-based method, the two-step method, and the one-step method. For both classification-based methods, we estimate the propensity score p(A|X) using logistic regression. Each method leads to a single-valued ITR and a set-valued A-ITR. In Table 2, we compare the different recommendations using the 5-fold cross-validated empirical weighted outcome defined in (13).
From Table 2, we observe that the one-step method with Gaussian kernel has the best weighted outcome. To illustrate the resultant A-ITR, we split the data into a training set (70%) and a test set (30%). We fit the training set using the one-step method with Gaussian kernel and then construct the recommendation set for patients in the test set. In our analysis, no patient is recommended to take the intermediate-acting insulin and the majority of patients are recommended to choose between the short-acting insulin and GLP-1. Specifically, 55% of patients are recommended the short-acting insulin only, 8% are recommended GLP-1 only, and 24% are recommended to take either one of the two. For the remaining 13% of patients, they are all recommended to take the long-acting insulin, including 1% who are suggested to take either the long-acting insulin or GLP-1, 5% who are suggested to take either the long-acting insulin or the short-acting insulin, and 7% whose only option is the long-acting insulin. We visualize the predicted treatments in Figure 5.
From Figure 5, we can see that age and BMI are two useful biomarkers in constructing the near-optimal recommendation set. In fact, by comparing the left panel and the right panel of Figure 5, we observe that BMI behaves like the first principal component (PC1) while age behaves like the second principal component (PC2). Figure 5 suggests that for patients without obesity (BMI less than 30), younger patients should take the long-acting insulin while older patients should take GLP-1. The short-acting insulin, on the other hand, serves as an "universal" treatment that many patients can take as an alternative, and is especially effective for overweighted patients.

Statistical Learning Theory
In this section, we study the convergence rate of the excess ℓ-risk in both linear learning and kernel learning settings. We assume the random vector Z = (X, A, Y) follows a certain distribution P that satisfies Assumption 1. Furthermore, we make an additional assumption.
Assumption 10 There is a constant C > 0 such that |Y/p(A|X)| ≤ C holds. For simplicity, we set C = 1 through out this section.
For f and f′, two (k − 1)-dimensional functions, and ℓ, an increasing, convex and Lipchitz loss function, denote We call e ℓ (f, f*) the excess ℓ-risk of f if f* is optimal within a certain function space ℱ.

Linear Learning
We first consider the linear function space, that is, we assume f = (f 1 ,…, f k−1 ) T with f j (x) = x T β j for j = 1,…, k − 1. For simplicity, we assume each covariate is bounded by [0, 1].
In Theorem 12, δ n stands for the approximation error between the optimal f in ℱ(p n , s n ) and the optimal f in ℱ(p n ). So if s n → ∞, δ n converges to 0. On the other hand, the first term O c(p n s n ) 1 ∕ 2 τ n log τ n −1 is the estimation error between f and f (p n ,s n ) , and as we increase s n , it becomes larger. The optimal tuning parameter s n is then chosen such that c(p n s n ) 1 ∕ 2 τ n log τ n −1 ∼ δ n .
In Theorem 12 we may allow s n → ∞ with an appropriately chosen rate. The reason is that when we include diverging number of covariates, i.e., p → ∞, f (p) can become more complicated, and thus we need a larger s n to accommodate this change. However, in practice it may not be necessary since the true model usually depends on a finite number of covariates. So we could simplify Theorem 12 if we make the assumption that there is a finite s* such that f (p) ∈ ℱ(p, s * ) for all p. For example, suppose f j * (x) = f j (p) (x) = ∑ l = 1 m x l β lj * for j = 1,…, k − 1 and all p = m,…, ∞. Then we can choose s * = ∑ j = 1 k − 1 ∑ l = 1 m | β lj * | 2 .
The convergence of excess ℓ-risk e ℓ (f, f * ) in Corollary 13 requires that p n = o(n).
Particularly, when p n grows no faster than n 1−r , where 0 < r < 1, it can be verified that the error rate is at an order of no greater than n −r ∕ 2 (log n) 3 ∕ 2 . This result is consistent with most of the classical asymptotic theory that the dimension of covariates should not be greater than the number of observations. Furthermore, we observe that if p n = O(1), then e ℓ (f, f * ) = O(n −1 ∕ 2 log n), which is almost O(n −1/2 ).

Kernel Learning
Next we discuss the convergence rate of excess ℓ-risk for kernel learning. We denote f = (f 1 ,…, f k−1 ) T to be a function in a reproducing kernel Hilbert space (RKHS) H with kernel function K(·, ·). Then by the RKHS theory, we can write f j (x) = ∑ i = 1 n K(x i , x)α ij + α 0j for j = 1,…, k − 1. To develop the theory for the proposed methods, we still need one more assumption.

Assumption 14
Suppose H is a separable RKHS equipped with kernel function K(·, ·). There exists a positive number B, such that K(x, x′) ≤ B for any x, x′ ∈ X.
Assumption 14 states that the RKHS is separable and the kernel function is bounded. This is true for many commonly used kernel functions. For example, for the Gaussian kernel, we may take B = 1. We define the function space as ℱ(n, s) = {f = (f 1 , …, f k − 1 ) T | f j (x) = ∑ i = 1 n K(x i , x)α ij + α 0j , j = 1, …, k − 1, J(f) ≤ s}, where J(f) = ∑ j = 1 k − 1 α j T Kα j + ∑ j = 1 k − 1 α 0j 2 and K is the gram matrix. Recall we have included intercepts in the penalty for simplicity. Let ℱ(∞) = lim n ∞ ∪ 0 ≤ s < ∞ ℱ(n, s), and define The following theorem gives the convergence rate of e ℓ (f, f (∞) ) when s = s n grows with n.
Similar to the linear case, there is a trade-off between the approximation error δ n and the estimation error O cB(s n ∕ n) 1 ∕ 2 log n in Theorem 15, and the optimal tuning parameter s n is determined roughly when cB(s n ∕ n) 1 ∕ 2 log n ∼ δ n .
Compared to Theorem 12, the excess ℓ-risk for RKHS learning seems to yield a faster rate. However, this is not always truly the case due to Assumption 14, which requires a bounded kernel function, and implies a restriction on the number of covariates p. For example, for linear kernel we have K(x, x′) = x T x′ ≤ p under Assumption 11. For Assumption 14 to be true, we have to let p = O(1). In this case both convergence rates are e ℓ (f, f (p) ) = O((s n ∕ n) 1 ∕ 2 log n); that of the kernel learning is no faster than that of the linear learning. In general, to obtain a faster rate than that of the linear learning, we need a kernel function that does not increase in p, such as the Gaussian kernel.

Conclusion and Discussions
In this work, we propose a new individualized treatment recommendation framework, named A-ITR, that has the capacity to recommend to patients near-optimal treatment options in terms of their clinical outcomes. By adopting the A-ITR, patients have the opportunity to choose the treatment options tailored for their different financial situations, personal preferences, and lifestyle choices. To estimate the optimal A-ITR, we proposed two classification-based methods based on the OWL framework. We also provide a new evaluation criterion suitable for A-ITRs, namely the weighted expected outcome, defined in (12). The simulation study shows the usefulness of this new criterion in parameter tuning and model selection.
The proposed methods may be subject to misuse when caution is not exercised with regard to the choice of the near-optimality constant c. Typically, c is chosen based on the physicians' experience on what defines near optimality for a particular outcome. When possible, the choice could be made more objective by using a measure of the variability of the outcomes for the top treatments. Lastly, recommendations from a variety of c values could be presented to the patients for a final selection, as long as any possible sacrifice of the outcome can be clearly explained to the patients.
There are several possible directions for future works. Firstly, the current A-ITR estimation is based on the OWL framework, which may be sensitive to the estimated propensity score. In particular, the OWL estimator is known to suffer a large variance in practice . To address this issue, one may consider applying the recently proposed augmented OWL framework (Zhao et al., 2019;Huang et al., 2019) to improve the finite sample performance. These methods typically have a double robustness property so that the efficiency of the estimator can be further improved. Secondly, when applying the A-ITR framework in practice, it may be desirable to adjust c for different needs or preferences.
A natural generalization of the method is to incorporate the patients' preferences on multiple outcomes into the framework. For example, we can consider A-ITRs with an additional competing outcome as a secondary endpoint , or A-ITR with additional safety endpoints formulated as constraints . Thirdly, besides the kernel method, we can consider other learning algorithms to estimate f within the A-ITR framework. Finally, the current method assumes the outcome Y is continuous. We can consider nontrivial extensions to other types of outcome such as count outcomes, survival outcomes (   ITR (top) and A-ITR (bottom) for a toy example given by Bayes rule, the two-step estimator, and one-step estimator. Yellow triangles indicate recommendations with two options, and black squares indicate three options. Both estimators give similar results to the Bayes rule in terms of the single-valued ITR. The two-step method does not provide as good A-ITR results as the one-step method. Comparison between ϕ* and ϕ + (left: ϕ* or ϕ + with c = 1; middle: ϕ* with c = 1.2; right: ϕ + with c = 1.2). Any point in the plot represents (μ 1 , μ 2 , μ 3 ) (suppose Y*(j) ∈ (0, 1)) with the recommendation illustrated by colors. Points in the red, green, and blue regions contain only one treatment; the yellow region contains two treatments; and the purple region includes all three treatments. S 1 * = S 1 + (unions of red, blue and green regions), and S 2 * ⊆ S 2 + (all but the purple regions).  The true ITR and A-ITR for observations in Examples 1 and 2 with recommendation types shown in colors. The orange, blue and green dots indicate that only one single treatment is suggested, which by definition is R 1 , while the yellow triangles indicate R 2 , and the black squares indicate R 3 . The upper panel is Example 1, the lower panel Example 2. The left panel is ITR, and the right panel is A-ITR (c = 1.2).

Figure 5:
Predicted treatment(s) for patients with recommendations given by different colors. The data set is projected on the first two principal components (left panel), and two particular covariates, age and BMI (right panel). Concentric circles indicate multiple treatments recommended to the same patient. Results of the simulation studies with n = 1000. In each region, the expected outcome for ITR and the outcome interval for A-ITR (c = 1.2) are reported. The empirical weighted outcome defined in (13) is shown in the "All" column. Each number is averaged over 100 replications. In each case, the best performing method is marked in bold.  The mean 5-fold cross validated weighted outcome and its standard error (in the parenthesis) over 100 replications for T2DM data. The method that yields the best result is marked in bold.